This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. ## Useful Resources - Matt Ashby Crime Mapping course: https://github.com/mpjashby/crimemapping/ - Spatial Modelling for Data Scientists: https://gdsl-ul.github.io/san/ - R for Data Science: https://r4ds.had.co.nz/index.html - Geocomputation with R: https://geocompr.robinlovelace.net/
In this notebook, I’ll be prediting crime trends in London by MSOA, and looking at where COVID has been most impactive. I’ll then try to find correlates.
# Data manipulation, transformation and visualisation
library(tidyverse)
# Nice tables
library(kableExtra)
# Simple features (a standardised way to encode vector data ie. points, lines, polygons)
library(sf)
# Spatial objects conversion
library(sp)
# Thematic maps
library(tmap)
# Colour palettes
library(RColorBrewer)
# More colour palettes
library(viridis)
library(raster) # raster data
library(rgdal) # input/output, projections
library(rgeos) # geometry ops
library(spdep) # spatial dependence
library(Metrics)
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱caret㤼㸲
The following objects are masked from 㤼㸱package:Metrics㤼㸲:
precision, recall
The following object is masked from 㤼㸱package:purrr㤼㸲:
lift
One of the first things I notice is that while Python code is generally quite careful about imports, here we globally import everything…which is nice, but I’m also not quite clear which functions are coming from which libraries.
Now, let’s import all our crime data. Let’s start by one dataframe before figuring how to automate and concatenate. Notice R uses slashes that are the otehr way to Python and Windows
test_df <- read.csv("crimes/2018-01/2018-01-metropolitan-street.csv")
test_df
NA
Let’s look at all the unique crime types. Notice how we have a similar to the “unique” method in Python, and access a specific function using the same syntax.
unique(test_df["Crime.type"])
To avoid this getting particularly computationally intensive, let’s write a function to pull out robberies and burglaries, and assign them a specific MSOA. Then we can iterate over all our months and get monthly counts for each offence type.
subset_df <- filter(test_df, Crime.type=="Burglary" | Crime.type=="Robbery")
subset_df
Weirdly, here I don’t need quotation marks for my colum name…not sure what’s driving that. But it’s easy to select a subset by column, and use logical comparators.
With that in mind, let’s now read our MSOA borders and assign these to an MSOA. I’ve set the OSGB CRS code as it doesn’t seem to automatically assign it.
lsoa_borders <- st_read("msoa_borders/MSOA_2011_London_gen_MHW.tab", crs=27700)
Reading layer `MSOA_2011_London_gen_MHW' from data source `C:\Users\andre\Dropbox\Data Projects\Covid_crime_shift\msoa_borders\MSOA_2011_London_gen_MHW.tab' using driver `MapInfo File'
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
lsoa_borders
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
First 10 features:
MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz
1 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5
3 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham E12000007 London 10045 10033 12 46.9 3834 2.6
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7
6 E02000007 Barking and Dagenham 006 E09000002 Barking and Dagenham E12000007 London 8791 8672 119 50.6 3441 2.5
7 E02000008 Barking and Dagenham 007 E09000002 Barking and Dagenham E12000007 London 11569 11564 5 81.5 4591 2.5
8 E02000009 Barking and Dagenham 008 E09000002 Barking and Dagenham E12000007 London 8395 8376 19 87.4 3212 2.6
9 E02000010 Barking and Dagenham 009 E09000002 Barking and Dagenham E12000007 London 8615 8615 0 76.8 3292 2.6
10 E02000011 Barking and Dagenham 010 E09000002 Barking and Dagenham E12000007 London 6187 6086 101 38.8 2289 2.7
geometry
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550 1873...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
7 MULTIPOLYGON (((548171.4 18...
8 MULTIPOLYGON (((546855 1863...
9 MULTIPOLYGON (((549618.8 18...
10 MULTIPOLYGON (((550244.1 18...
Notice that when you’re reading a frame that isn’t “tidy”, it’s messy as hell - this is a geodataframe. Makes me miss the elegance of Geopandas somewhat. Still, it’s easy to import, and you get pretty plots. Given MOPAC data is in national grid, we’re also going to have to re-project this.
It’s also very easy to plot.
plot(lsoa_borders)
plotting the first 9 out of 12 attributes; use max.plot = 12 to plot all
No idea why it wants that much white space though….
Now, let’s reproject and do a spatial join to assign all of my crimes to an MSOA. Given we’re working on London, let’s change change everything to that. We’ll start by changing my crime dataframe, that currently has latitude and longitudes as just numbers, to a spatial dataframe with coordinates
'subset_spatial <- st_as_sf(subset_df, coords = c("Longitude", "Latitude"),
crs = 4326, remove = FALSE)
subset_spatial'
[1] "subset_spatial <- st_as_sf(subset_df, coords = c(\"Longitude\", \"Latitude\"), \n crs = 4326, remove = FALSE)\n\nsubset_spatial"
Ah, we have missing values. Time to learn how to drop those.
In Pandas, we have easy functions to “drop_na” and “is_na” - I’m hoping to quickly find equivalents. My favourite Python approach to this is df.isna().sum(), counting how many “true” values you have when filtering like that. Can we duplicate that process?
sum(is.na(subset_df["Longitude"]))
[1] 82
sum(is.na(subset_df["Latitude"]))
[1] 82
We can! Glorious. We have 82 missing coordinates. Let’s drop all those rows.
clean_df <- subset_df[!rowSums(is.na(subset_df["Longitude"])), ]
clean_df
Annoyingly, while R does have a drop_na function, it doesn’t take a “subset” argument like Python, which means this slightly painful fudge.
We should now be able to form our spatial df.
subset_spatial <- st_as_sf(clean_df, coords = c("Longitude", "Latitude"),
crs = 4326, remove = FALSE)
subset_spatial
Simple feature collection with 10419 features and 12 fields
geometry type: POINT
dimension: XY
bbox: xmin: -0.492381 ymin: 51.28683 xmax: 0.273434 ymax: 51.68564
geographic CRS: WGS 84
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
5 50ad5d2dfea24afec9e17218db62b3d29786775db1060634ae7d4a6e7cafc3ff 2018-01 Metropolitan Police Service Metropolitan Police Service 0.127794 51.58419
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context geometry
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA POINT (0.140035 51.58911)
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA POINT (0.140035 51.58911)
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (0.135554 51.58499)
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (0.140035 51.58911)
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA POINT (0.127794 51.58419)
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA POINT (0.138439 51.5785)
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.138439 51.5785)
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.139479 51.57974)
9 On or near Portland Close E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (0.135119 51.57849)
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA POINT (0.140452 51.5811)
plot(subset_spatial)
plotting the first 9 out of 12 attributes; use max.plot = 12 to plot all
Success! That looks faintly promising. Now, let’s figure out how to re-project.
latlong = "+init=epsg:4326"
ukgrid = "+init=epsg:27700"
subset_osgb <- st_transform(subset_spatial, ukgrid)
GDAL Message 1: +init=epsg:XXXX syntax is deprecated. It might return a CRS with a non-EPSG compliant axis order.
subset_osgb
Simple feature collection with 10419 features and 12 fields
geometry type: POINT
dimension: XY
bbox: xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
5 50ad5d2dfea24afec9e17218db62b3d29786775db1060634ae7d4a6e7cafc3ff 2018-01 Metropolitan Police Service Metropolitan Police Service 0.127794 51.58419
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context geometry
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA POINT (548349 189976)
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA POINT (548349 189976)
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (548052 189507.9)
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA POINT (548349 189976)
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA POINT (547517 189404)
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA POINT (548273 188793)
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (548273 188793)
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (548341 188933)
9 On or near Portland Close E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA POINT (548043 188785)
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA POINT (548404 189086)
Error messages in R are definitely harder to digest for me so far…I’m hoping that will pass with time. I’m also finding the documentation slightly harder to figure out, with fewer worked examples. Still, so far, so easily translateable! Now, let’s spatial join this up.
crime_with_msoa <- st_join(subset_osgb, lsoa_borders["MSOA11CD"])
crime_with_msoa
Simple feature collection with 10419 features and 13 fields
geometry type: POINT
dimension: XY
bbox: xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
First 10 features:
Crime.ID Month Reported.by Falls.within Longitude Latitude
1 628e0d673aa1b6a70479342a64b02884499df85b18dcd63cc9bff3cff9f704bc 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
2 f8e9db16dca534a83493198a838567aa5adc9dd56496edc2fff5bb4c62b8303e 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
3 cc34822074b130f141f16d02fdb2d500c86e22ae18324b43a3231b381af3f45c 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135554 51.58499
4 10de581c3cd0a8c9b970824cd7589d13148d63a70b3115d95ef6c24dc0bd2c3b 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140035 51.58911
5 50ad5d2dfea24afec9e17218db62b3d29786775db1060634ae7d4a6e7cafc3ff 2018-01 Metropolitan Police Service Metropolitan Police Service 0.127794 51.58419
6 95abc6eb0b755c9250d19bbe0062fcd4a509b701964d89667401c9dc96ca257d 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
7 035cc894d732addb5009148d8e163e6360094cfe451f621348f1c0419b9cbc77 2018-01 Metropolitan Police Service Metropolitan Police Service 0.138439 51.57850
8 495cac920dcf9e0e4927074e8ac307f17d340f01c69e434c4a3721df017cd342 2018-01 Metropolitan Police Service Metropolitan Police Service 0.139479 51.57974
9 48234e70cbc22265ee7968da92df1ca72f83b45414cce486ec2203daa4e59fa2 2018-01 Metropolitan Police Service Metropolitan Police Service 0.135119 51.57849
10 3f4ba20780987c37816ff34fd0f7760cf503b2e648777f84ee0104432cb01d66 2018-01 Metropolitan Police Service Metropolitan Police Service 0.140452 51.58110
Location LSOA.code LSOA.name Crime.type Last.outcome.category Context MSOA11CD
1 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Offender sent to prison NA E02000002
2 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Investigation complete; no suspect identified NA E02000002
3 On or near Rose Lane E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA E02000002
4 On or near Beansland Grove E01000027 Barking and Dagenham 001A Burglary Status update unavailable NA E02000002
5 On or near Hope Close E01000028 Barking and Dagenham 001B Burglary Status update unavailable NA E02000002
6 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Investigation complete; no suspect identified NA E02000002
7 On or near Geneva Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
8 On or near Yew Tree Gardens E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
9 On or near Portland Close E01000029 Barking and Dagenham 001C Burglary Status update unavailable NA E02000002
10 On or near Pedestrian Subway E01000030 Barking and Dagenham 001D Robbery Investigation complete; no suspect identified NA E02000002
geometry
1 POINT (548349 189976)
2 POINT (548349 189976)
3 POINT (548052 189507.9)
4 POINT (548349 189976)
5 POINT (547517 189404)
6 POINT (548273 188793)
7 POINT (548273 188793)
8 POINT (548341 188933)
9 POINT (548043 188785)
10 POINT (548404 189086)
That looks like it worked - it defaults to a left join. Now, let’s group by offence type and MSOA, so as to get a count of robbery and burglary per MSOA for this month. We should then have all the code we need to create our function.
msoa_list<- crime_with_msoa %>%
group_by(MSOA11CD, Crime.type) %>%
summarize(count_by_msoa = n())
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
msoa_list
Simple feature collection with 1722 features and 3 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: 504499 ymin: 155908 xmax: 557677 ymax: 200168
projected CRS: OSGB 1936 / British National Grid
It works! We’ll need to fill every missing value with 0, drop the geometry column, and then repeat the process for every month, and then we’re in business.
We’ll have to check for any values that aren’t MSOAs that aren’t present, and if they’re not, add a 0. I’m going to do this for robbery and burglary independently.
class(msoa_list)
[1] "sf" "grouped_df" "tbl_df" "tbl" "data.frame"
We need to remove the geometry column. Currently, our object is a spatial dataframe (sf) and a tibble (a tidyverse specific dataframe type) which is stopping me from removing the geometry data.
msoa_pivot_tibble <- as_tibble(msoa_list)
msoa_pivot_tibble
class(msoa_pivot_tibble)
[1] "tbl_df" "tbl" "data.frame"
We’ve now removed the spatial frame function, and should be able to drop the last column (geometry.)
msoa_pivot_tibble <- msoa_pivot_tibble[0:3]
msoa_pivot_tibble
We now need to fill our missing values. Rather than iterate or similar, I’ll just add an entire df filled with 0s for both crime types, then drop any duplicates - at least, that’s how I’d do it in Python, and shall try to do here!
#creating a df with all msoa names, for robbery and burglary
msoa_zero_df_robbery <- unique(as_tibble(lsoa_borders)["MSOA11CD"])
msoa_zero_df_burglary <- unique(as_tibble(lsoa_borders)["MSOA11CD"])
#adding our crime type column
msoa_zero_df_burglary["Crime.type"] = "Burglary"
msoa_zero_df_robbery["Crime.type"] = "Robbery"
#Creating a "count" column identical to our pivot, and filling it with 0
msoa_zero_df_burglary["count_by_msoa"] = as.numeric(0)
msoa_zero_df_robbery["count_by_msoa"] = as.numeric(0)
msoa_zero_df_robbery
I’m a little worried about the “dbl” class, but let’s ignore that for now. Now, we need to concatenate both, and add them to our MSOA pivot.
duplicate_concat <- rbind(msoa_zero_df_robbery, msoa_zero_df_burglary)
duplicate_concat
It seems to have worked. That said, the fact there is no clear function for concatenation (in contrast to pd.concatenate in Pandas) surprises me. Now, let’s finally concatenate everything, and remove duplicates. That should form our final monthly df, and we can then combine all our previous steps into a function.
df_with_dups <- rbind(msoa_pivot_tibble, duplicate_concat)
df_with_dups
It’s noticeable how much harder finding documentation is for R than Pandas - while the drop_duplicates function is front and center for any searches in Python, a similar search in R reveals plenty of hacky filters, but the “distinct” function seems to be what I’m actually looking for.
#creating a filter for duplicates columns, which should ignore the first instance
dup_filters <- duplicated(df_with_dups[0:2])
monthly_df <- filter(df_with_dups, !dup_filters)
monthly_df
That should now be all our values. As a sanity check, let’s make sure we have the right number of rows, using R’s “dim” function (equivalent to shape in Pandas) to check how many unique values we would expect.
dim(unique(as_tibble(lsoa_borders)["MSOA11CD"]))[1] * 2
[1] 1966
We’ve got 2 extra…a bit weird, but not end of world. Let’s leave it at that.
We now need to add our monthly date to this dataframe
#select the first unique value of months in the original dataframe
month <- unique(test_df["Month"])[1,1]
monthly_df["Month"] <- month
monthly_df
Now, let’s bring all our previous work together into a function (and fix my awkward prior msoa/lsoa typo)
#quick initial function to generate our MSOA borde spatial frame, to avoid it sitting in the initial frame and gobbling loads of memory.
generate_msoa_borders <- function(file){
msoa_borders <- st_read(file, crs=27700)
return(msoa_borders)
}
make_month_pivot <- function(file){
#define our CRS
latlong = "+init=epsg:4326"
ukgrid = "+init=epsg:27700"
#read our crime from the file
test_df <- read.csv(file)
#select only our target crime types
subset_df <- filter(test_df, Crime.type=="Burglary" | Crime.type=="Robbery")
#remove any rows with a long/lat coordinate
clean_df <- subset_df[!rowSums(is.na(subset_df["Longitude"])), ]
#generate a spatial df
subset_spatial <- st_as_sf(clean_df, coords = c("Longitude", "Latitude"),
crs = 4326, remove = FALSE)
#reproject to uk grid coords
subset_osgb <- st_transform(subset_spatial, ukgrid)
#spatially join to assign to an MSOA
crime_with_msoa <- st_join(subset_osgb, msoa_borders["MSOA11CD"])
#summarise by count of MSOA
msoa_list<- crime_with_msoa %>%
group_by(MSOA11CD, Crime.type) %>%
summarize(count_by_msoa = n())
#return to a non-geographic msoa
msoa_pivot_tibble <- as_tibble(msoa_list)
msoa_pivot_tibble <- msoa_pivot_tibble[0:3]
#creating a df with all msoa names, for robbery and burglary
msoa_zero_df_robbery <- unique(as_tibble(msoa_borders)["MSOA11CD"])
msoa_zero_df_burglary <- unique(as_tibble(msoa_borders)["MSOA11CD"])
#adding our crime type column
msoa_zero_df_burglary["Crime.type"] = "Burglary"
msoa_zero_df_robbery["Crime.type"] = "Robbery"
#Creating a "count" column identical to our pivot, and filling it with 0
msoa_zero_df_burglary["count_by_msoa"] = as.numeric(0)
msoa_zero_df_robbery["count_by_msoa"] = as.numeric(0)
duplicate_concat <- rbind(msoa_zero_df_robbery, msoa_zero_df_burglary)
df_with_dups <- rbind(msoa_pivot_tibble, duplicate_concat)
#creating a filter for duplicates columns, which should ignore the first instance
dup_filters <- duplicated(df_with_dups[0:2])
monthly_df <- filter(df_with_dups, !dup_filters)
#re-add our month column
month <- unique(test_df["Month"])[1,1]
monthly_df["Month"] <- month
return(monthly_df)
}
We’ve now got our tooling for a data pipeline ready to go! We can now run this on every single month of data, and aggregate into a historical combined data-set.
Data.Police.UK comes as a bunch of nested-subdirectories…in hindsight, I probably should have looked at their API, but for now let’s power ahead and figure out how to extract a list of all the CSV files in our folder and the various sub-directories.
list.files(path = "crimes")
[1] "2018-01" "2018-02" "2018-03" "2018-04" "2018-05" "2018-06" "2018-07" "2018-08" "2018-09" "2018-10" "2018-11" "2018-12" "2019-01" "2019-02" "2019-03" "2019-04"
[17] "2019-05" "2019-06" "2019-07" "2019-08" "2019-09" "2019-10" "2019-11" "2019-12" "2020-01" "2020-02" "2020-03" "2020-04" "2020-05" "2020-06" "2020-07" "2020-08"
[33] "2020-09" "2020-10" "2020-11" "2020-12"
As we suspected, the nested directories cause an issue - guess we’re going to have to learn about loops in R! Let’s iterate over our list of subfolders, and re-apply the function to each.
subfolders <- list.files(path = "crimes")
file_list <- list()
for (folder in subfolders){
folder_subdir <- "crimes/"
#concatenate to get our total subfolder directory - hacky but will work here.
sub_path <- paste(folder_subdir, folder, sep="")
list.files(sub_path)
file_list <- list(file_list, paste(sub_path,"/", list.files(sub_path), sep=""))
}
The bad news is, this totally didn’t work. The good news is, it led me to the far cleaner, “recursive” version of the read files function.
list.files(path = "crimes", recursive=T)
[1] "2018-01/2018-01-metropolitan-street.csv" "2018-02/2018-02-metropolitan-street.csv" "2018-03/2018-03-metropolitan-street.csv"
[4] "2018-04/2018-04-metropolitan-street.csv" "2018-05/2018-05-metropolitan-street.csv" "2018-06/2018-06-metropolitan-street.csv"
[7] "2018-07/2018-07-metropolitan-street.csv" "2018-08/2018-08-metropolitan-street.csv" "2018-09/2018-09-metropolitan-street.csv"
[10] "2018-10/2018-10-metropolitan-street.csv" "2018-11/2018-11-metropolitan-street.csv" "2018-12/2018-12-metropolitan-street.csv"
[13] "2019-01/2019-01-metropolitan-street.csv" "2019-02/2019-02-metropolitan-street.csv" "2019-03/2019-03-metropolitan-street.csv"
[16] "2019-04/2019-04-metropolitan-street.csv" "2019-05/2019-05-metropolitan-street.csv" "2019-06/2019-06-metropolitan-street.csv"
[19] "2019-07/2019-07-metropolitan-street.csv" "2019-08/2019-08-metropolitan-street.csv" "2019-09/2019-09-metropolitan-street.csv"
[22] "2019-10/2019-10-metropolitan-street.csv" "2019-11/2019-11-metropolitan-street.csv" "2019-12/2019-12-metropolitan-street.csv"
[25] "2020-01/2020-01-metropolitan-street.csv" "2020-02/2020-02-metropolitan-street.csv" "2020-03/2020-03-metropolitan-street.csv"
[28] "2020-04/2020-04-metropolitan-street.csv" "2020-05/2020-05-metropolitan-street.csv" "2020-06/2020-06-metropolitan-street.csv"
[31] "2020-07/2020-07-metropolitan-street.csv" "2020-08/2020-08-metropolitan-street.csv" "2020-09/2020-09-metropolitan-street.csv"
[34] "2020-10/2020-10-metropolitan-street.csv" "2020-11/2020-11-metropolitan-street.csv" "2020-12/2020-12-metropolitan-street.csv"
First, let’s create an empty dataframe we can concatenate all teh rest to.
empty_df <- tibble(
MSOA11CD = "",
Crime.type= "",
count_by_msoa= "",
Month= ""
)
empty_df
Annoyingly, I don’t seem to be able to cr
msoa_borders <- generate_msoa_borders("msoa_borders/MSOA_2011_London_gen_MHW.tab")
Reading layer `MSOA_2011_London_gen_MHW' from data source `C:\Users\andre\Dropbox\Data Projects\Covid_crime_shift\msoa_borders\MSOA_2011_London_gen_MHW.tab' using driver `MapInfo File'
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
msoa_borders
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
First 10 features:
MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz
1 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5
3 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham E12000007 London 10045 10033 12 46.9 3834 2.6
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7
6 E02000007 Barking and Dagenham 006 E09000002 Barking and Dagenham E12000007 London 8791 8672 119 50.6 3441 2.5
7 E02000008 Barking and Dagenham 007 E09000002 Barking and Dagenham E12000007 London 11569 11564 5 81.5 4591 2.5
8 E02000009 Barking and Dagenham 008 E09000002 Barking and Dagenham E12000007 London 8395 8376 19 87.4 3212 2.6
9 E02000010 Barking and Dagenham 009 E09000002 Barking and Dagenham E12000007 London 8615 8615 0 76.8 3292 2.6
10 E02000011 Barking and Dagenham 010 E09000002 Barking and Dagenham E12000007 London 6187 6086 101 38.8 2289 2.7
geometry
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550 1873...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
7 MULTIPOLYGON (((548171.4 18...
8 MULTIPOLYGON (((546855 1863...
9 MULTIPOLYGON (((549618.8 18...
10 MULTIPOLYGON (((550244.1 18...
Our MSOA border helper functions seems to work. Now, time to do the heavy lifting!
subfiles <- list.files(path = "crimes", recursive=T)
for (file in subfiles){
folder_subdir <- "crimes/"
#concatenate to get our total subfolder directory - hacky but will work here.
sub_path <- paste(folder_subdir, file, sep="")
monthly_df <- make_month_pivot(sub_path)
empty_df <- rbind(empty_df, monthly_df)
}
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'MSOA11CD'. You can override using the `.groups` argument.
That seems to have worked! The processing time was longer than I expected (which is probably something to do with how R stores memory) - let’s look at our previously empty dataframe.
empty_df
unique(empty_df["Month"])
So we now have a combined dataframe of just under 71,000 rows, for 37 individual months beetween January 2018 and December 2020, for every robbery and burglary in London, assigned to an MSOA. I’d call that a win!
This has been somewhat more painful than I expected, so before going any further, let’s figure out how to save this file. Given it’s all strings and integers, a simple CSV should do for now.
write.csv(empty_df,"msoa_crime_matrix.csv")
We can now move on to the fun bit - predicting the crime trend we’d expect, and then looking at how much it diverges when we hit the “pandemic disruption” period.
I’ll either be using auto-arima or Facebook’s prophet algorithm, both of which produce relatively accurate forecasts with little necessary tuning. I’m hoping 2018 through early 2020 should be sufficient to establish trends and seasonality. We can then use our error rate from March 2020 onwards as a measure of the “pandemic effect”.
As a test, let’s start by predicting one MSOA: the first in our df, “E02000001”
empty_df <- read.csv("msoa_crime_matrix.csv")
empty_df <- empty_df[2:70848,2:5]
empty_df
single_msoa_df <- filter(empty_df, MSOA11CD == "E02000001" & Crime.type=="Burglary")
single_msoa_df
As we’d expect, 36 rows for 36 months. Let’s covert those rows to a date, and start making predictions.
single_msoa_df$DateString <- paste(single_msoa_df$Month, "-01")
single_msoa_df
NA
Converting these to dates was harder than I’d anticipated, but the Tidyverse ecosystem does have some nifty tools!
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following objects are masked from 㤼㸱package:rgeos㤼㸲:
intersect, setdiff, union
The following objects are masked from 㤼㸱package:raster㤼㸲:
intersect, union
The following objects are masked from 㤼㸱package:base㤼㸲:
date, intersect, setdiff, union
single_msoa_df$DateClean <- ymd(single_msoa_df$DateString)
single_msoa_df
Let’s now build our “pre-pandemic” training set, up to February 2020, and use prophet to make some predictions.
training_set <- filter(single_msoa_df, DateClean < "2020-03-01")
training_set
NA
training_df <- tibble(
ds=training_set$DateClean,
y=training_set$count_by_msoa
)
training_df
We can now instantiate our prophet model, and start making predictions.
library(prophet)
Loading required package: Rcpp
Loading required package: rlang
Attaching package: 㤼㸱rlang㤼㸲
The following object is masked from 㤼㸱package:Metrics㤼㸲:
ll
The following objects are masked from 㤼㸱package:purrr㤼㸲:
%@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl, flatten_raw, invoke, list_along, modify, prepend, splice
m <- prophet(training_df)
Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
n.changepoints greater than number of observations. Using 19
We’ll predict for a period of around 3 months and then compare to what actually happened.
future <- make_future_dataframe(m, periods = 6, freq = 'month')
tail(future)
Let’s get forecasting
forecast <- predict(m, future)
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
NA
# R
plot(m, forecast)
♠# R
Error: unexpected input in "\"
These predictions obviously look a little silly, but the yearly trend (which is what we really wanted to get out of this) doesn’t look mad to me. I’m hoping that using all our MSOA in aggregate, we’ll get meaningful data. Firstly, we need to get our error rate.
head(forecast)
forecast$Month <- month(forecast$ds)
forecast$Year <- year(forecast$ds)
forecast
To do next.
Group prediction by month We’ll look at April and May, which were “peak” London COVID effect
forecast
this_year <- filter(forecast, Year > 2019)
peak_pandemic <- filter(this_year, Month== 4 | Month== 5 )
peak_pandemic
predictionPivot <- peak_pandemic %>%
group_by(Month) %>%
summarize(predicted_burglary = mean(yhat))
predictionPivot
Now, let’s compare that to our ACTUAL data, and get an error rate.
single_msoa_df$MonthNum <- month(single_msoa_df$DateClean)
single_msoa_df$YearNum <- year(single_msoa_df$DateClean)
this_year_actual <- filter(single_msoa_df, YearNum > 2019)
peak_pandemic_actual <- filter(this_year_actual, MonthNum== 4 | MonthNum== 5 )
peak_pandemic_actual
As such, our model anticipated around 18 burglaries during this period, while in reality, there was 1. This is the measure of our “covid error” for the MSOA. Let’s get that measure into a percentage error as well as absolute error, and then we can repeat the process for all of London
actual_burglary <- sum(peak_pandemic_actual$count_by_msoa)
pred_burglary <- sum(predictionPivot$predicted_burglary)
error <- actual_burglary - pred_burglary
percentage_error <- error / pred_burglary
print("Burglary Count")
print(actual_burglary)
print("Predicted")
print(pred_burglary)
print("Actual Error")
print(error)
print("Percentage Error")
print(percentage_error)
Now, let’s automate. Similar to our previous process, we’ll create a dataframe for every MSOA, and both types, then run a for loop repeating our process across London.
msoa_error_tibble <- tibble(
MSOA11CD = "",
burglaryActual= "",
burglaryPredicted= "",
burglaryError= "",
burglaryPercentError="",
robberyActual= "",
robberyPredicted= "",
robberyError= "",
robberyPercentError=""
)
msoa_error_tibble
calculate_error <- function(msoaName){
#select only burglary and our msoa
single_msoa_df <- filter(empty_df, MSOA11CD == msoaName & Crime.type=="Burglary")
#clean date date
single_msoa_df$DateString <- paste(single_msoa_df$Month, "-01")
single_msoa_df$DateClean <- ymd(single_msoa_df$DateString)
#generate training set up until March
training_set <- filter(single_msoa_df, DateClean < "2020-03-01")
#prepare for Prophet
training_df <- tibble(
ds=training_set$DateClean,
y=training_set$count_by_msoa)
#start and predict prophet for 6 months
m <- prophet(training_df)
future <- make_future_dataframe(m, periods = 6, freq = 'month')
forecast <- predict(m, future)
forecast$Month <- month(forecast$ds)
forecast$Year <- year(forecast$ds)
#aggregate forecasts and actual crime
this_year <- filter(forecast, Year > 2019)
peak_pandemic <- filter(this_year, Month== 4 | Month== 5 )
predictionPivot <- peak_pandemic %>%
group_by(Month) %>%
summarize(predicted_burglary = mean(yhat))
single_msoa_df$MonthNum <- month(single_msoa_df$DateClean)
single_msoa_df$YearNum <- year(single_msoa_df$DateClean)
#generate error rates
this_year_actual <- filter(single_msoa_df, YearNum > 2019)
peak_pandemic_actual <- filter(this_year_actual, MonthNum== 4 | MonthNum== 5 )
actual_burglary <- sum(peak_pandemic_actual$count_by_msoa)
pred_burglary <- sum(predictionPivot$predicted_burglary)
error_burg <- actual_burglary - pred_burglary
percentage_error_burg <- error_burg / pred_burglary
#now repeat for robbery
single_msoa_df <- filter(empty_df, MSOA11CD == msoaName & Crime.type=="Robbery")
single_msoa_df$DateString <- paste(single_msoa_df$Month, "-01")
single_msoa_df$DateClean <- ymd(single_msoa_df$DateString)
training_set <- filter(single_msoa_df, DateClean < "2020-03-01")
training_df <- tibble(
ds=training_set$DateClean,
y=training_set$count_by_msoa)
m <- prophet(training_df)
future <- make_future_dataframe(m, periods = 6, freq = 'month')
forecast <- predict(m, future)
forecast$Month <- month(forecast$ds)
forecast$Year <- year(forecast$ds)
this_year <- filter(forecast, Year > 2019)
peak_pandemic <- filter(this_year, Month== 4 | Month== 5 )
predictionPivot <- peak_pandemic %>%
group_by(Month) %>%
summarize(predicted_burglary = mean(yhat))
single_msoa_df$MonthNum <- month(single_msoa_df$DateClean)
single_msoa_df$YearNum <- year(single_msoa_df$DateClean)
this_year_actual <- filter(single_msoa_df, YearNum > 2019)
peak_pandemic_actual <- filter(this_year_actual, MonthNum== 4 | MonthNum== 5 )
actual_robbery <- sum(peak_pandemic_actual$count_by_msoa)
pred_robbery <- sum(predictionPivot$predicted_burglary)
error_rob <- actual_robbery - pred_robbery
percentage_error_rob <- error_rob / pred_robbery
#create our output dataframe and return it
msoa_error_tibble <- tibble(
MSOA11CD = msoaName,
burglaryActual= actual_burglary,
burglaryPredicted= pred_burglary,
burglaryError= error_burg,
burglaryPercentError = percentage_error_burg,
robberyActual= actual_robbery,
robberyPredicted= pred_robbery,
robberyError= error_rob,
robberyPercentError=percentage_error_rob
)
return(msoa_error_tibble)
}
Messy and hacky, but theoretically functional! Now for the long bit - let’s loop over all our MSOAs, and get our aggregate error dataframe.
for (msoa in unique(empty_df$MSOA11CD)){
iterated_msoa_df <- calculate_error(msoa)
msoa_error_tibble <- rbind(msoa_error_tibble, iterated_msoa_df)
}
msoa_error_tibble
write_csv(msoa_error_tibble, "msoa_error_table.csv")
msoa_error_tibble
msoa_error_tibble[,2:9] <- lapply(msoa_error_tibble[,2:9], as.numeric)
msoa_error_tibble <- msoa_error_tibble[2:980, ]
msoa_error_tibble
Let’s also calculate the “Relative Percentage Difference” (RPD) of our estimates.
msoa_error_tibble$RPDBurglary <- 2*((msoa_error_tibble$burglaryPredicted - msoa_error_tibble$burglaryActual)/(abs(msoa_error_tibble$burglaryPredicted) + abs(msoa_error_tibble$burglaryActual)))
msoa_error_tibble$RPDRobbery <- 2*((msoa_error_tibble$robberyPredicted - msoa_error_tibble$robberyActual)/(abs(msoa_error_tibble$robberyPredicted) + abs(msoa_error_tibble$robberyActual)))
Now, let’s link all these back to our original geographic dataframe.
lsoa_borders <- st_read("msoa_borders/MSOA_2011_London_gen_MHW.tab", crs=27700)
Reading layer `MSOA_2011_London_gen_MHW' from data source `C:\Users\andre\Dropbox\Data Projects\Covid_crime_shift\msoa_borders\MSOA_2011_London_gen_MHW.tab' using driver `MapInfo File'
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
lsoa_borders
Simple feature collection with 983 features and 12 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
First 10 features:
MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz
1 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5
3 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham E12000007 London 10045 10033 12 46.9 3834 2.6
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7
6 E02000007 Barking and Dagenham 006 E09000002 Barking and Dagenham E12000007 London 8791 8672 119 50.6 3441 2.5
7 E02000008 Barking and Dagenham 007 E09000002 Barking and Dagenham E12000007 London 11569 11564 5 81.5 4591 2.5
8 E02000009 Barking and Dagenham 008 E09000002 Barking and Dagenham E12000007 London 8395 8376 19 87.4 3212 2.6
9 E02000010 Barking and Dagenham 009 E09000002 Barking and Dagenham E12000007 London 8615 8615 0 76.8 3292 2.6
10 E02000011 Barking and Dagenham 010 E09000002 Barking and Dagenham E12000007 London 6187 6086 101 38.8 2289 2.7
geometry
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550 1873...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
7 MULTIPOLYGON (((548171.4 18...
8 MULTIPOLYGON (((546855 1863...
9 MULTIPOLYGON (((549618.8 18...
10 MULTIPOLYGON (((550244.1 18...
geographic_error_map <- left_join(lsoa_borders, msoa_error_tibble, by = "MSOA11CD")
geographic_error_map
Simple feature collection with 983 features and 22 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
projected CRS: OSGB 1936 / British National Grid
First 10 features:
MSOA11CD MSOA11NM LAD11CD LAD11NM RGN11CD RGN11NM UsualRes HholdRes ComEstRes PopDen Hholds AvHholdSz burglaryActual
1 E02000001 City of London 001 E09000001 City of London E12000007 London 7375 7187 188 25.5 4385 1.6 1
2 E02000002 Barking and Dagenham 001 E09000002 Barking and Dagenham E12000007 London 6775 6724 51 31.3 2713 2.5 8
3 E02000003 Barking and Dagenham 002 E09000002 Barking and Dagenham E12000007 London 10045 10033 12 46.9 3834 2.6 11
4 E02000004 Barking and Dagenham 003 E09000002 Barking and Dagenham E12000007 London 6182 5937 245 24.8 2318 2.6 2
5 E02000005 Barking and Dagenham 004 E09000002 Barking and Dagenham E12000007 London 8562 8562 0 72.1 3183 2.7 4
6 E02000007 Barking and Dagenham 006 E09000002 Barking and Dagenham E12000007 London 8791 8672 119 50.6 3441 2.5 5
7 E02000008 Barking and Dagenham 007 E09000002 Barking and Dagenham E12000007 London 11569 11564 5 81.5 4591 2.5 10
8 E02000009 Barking and Dagenham 008 E09000002 Barking and Dagenham E12000007 London 8395 8376 19 87.4 3212 2.6 16
9 E02000010 Barking and Dagenham 009 E09000002 Barking and Dagenham E12000007 London 8615 8615 0 76.8 3292 2.6 3
10 E02000011 Barking and Dagenham 010 E09000002 Barking and Dagenham E12000007 London 6187 6086 101 38.8 2289 2.7 2
burglaryPredicted burglaryError burglaryPercentError robberyActual robberyPredicted robberyError robberyPercentError RPDBurglary RPDRobbery
1 7.62501855 -6.6250185 -0.8688528 1 -1.976662 2.976662 -1.5059034 1.5362329 -2.0000000
2 -9.23326714 17.2332671 -1.8664322 0 1.129576 -1.129576 -1.0000000 -2.0000000 2.0000000
3 12.34800064 -1.3480006 -0.1091675 10 6.922428 3.077572 0.4445798 0.1154703 -0.3637269
4 -4.71960263 6.7196026 -1.4237645 0 -1.271292 1.271292 -1.0000000 -2.0000000 -2.0000000
5 4.58490183 -0.5849018 -0.1275713 1 9.214027 -8.214027 -0.8914698 0.1362629 1.6083817
6 13.29091514 -8.2909151 -0.6238032 5 1.267089 3.732911 2.9460534 0.9065610 -1.1912744
7 11.22961111 -1.2296111 -0.1094972 1 -5.357023 6.357023 -1.1866708 0.1158393 -2.0000000
8 -1.51347706 17.5134771 -11.5716832 3 -3.718159 6.718159 -1.8068510 -2.0000000 -2.0000000
9 -0.07818478 3.0781848 -39.3706412 0 -5.041581 5.041581 -1.0000000 -2.0000000 -2.0000000
10 -1.54900357 3.5490036 -2.2911526 1 4.929491 -3.929491 -0.7971393 -2.0000000 1.3254058
geometry
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550 1873...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
7 MULTIPOLYGON (((548171.4 18...
8 MULTIPOLYGON (((546855 1863...
9 MULTIPOLYGON (((549618.8 18...
10 MULTIPOLYGON (((550244.1 18...
Let’s map both of these metrics, and see what it looks like.
# map
burg_map <- tm_shape(geographic_error_map) +
tm_fill(col = "RPDRobbery", title = "Robbery Relative Error")
rob_map <-tm_shape(geographic_error_map) +
tm_fill(col = "RPDBurglary", title = "Burglary Relative Error")
tmap_arrange(burg_map, rob_map)
NA
NA
As a final part of this project, I’m going to explore some geographic modelling. Let’s start with linking our current data with the London MOPAC MSOA Atlas, which should provide a whole bunch of useful demographic and economic data. I’ve slightly modified it in Excel to get rid of the weird header structure.
library(readxl)
msoa_atlas <- read_excel("msoa_atlas/msoa-data.xls")
New names:
* `House Prices Sales 2011` -> `House Prices Sales 2011...129`
* `House Prices Sales 2011` -> `House Prices Sales 2011...130`
msoa_atlas
Let’s do one last spatial join to bring all these things together
geographic_msoa_matrix <- left_join(geographic_error_map, msoa_atlas, by = "MSOA11CD")
Let’s provide a tible version as well for clear analysis
msoa_matrix_tbl <- as_tibble(geographic_msoa_matrix)
msoa_matrix_tbl
Let’s look at how correlated our factors are
library(corrr)
Attaching package: 㤼㸱corrr㤼㸲
The following object is masked from 㤼㸱package:raster㤼㸲:
stretch
corr_df <- correlate(msoa_matrix_tbl, quiet = TRUE)
Error in stats::cor(x = x, y = y, use = use, method = method) :
'x' must be numeric
Annoyingly, unlike Pandas, R throws up errors here (while Python implicitly gets rid of non-numerical columns - let’s clean it up
msoa_matrix_numeric <-dplyr::select_if(msoa_matrix_tbl, is.numeric)
msoa_matrix_numeric
corr_df <- correlate(dplyr::select_if(msoa_matrix_tbl, is.numeric), quiet = TRUE)
corr_df
NA
NA
Let’s now look for correlates of our error rate for burglary and robbery
options(scipen = 999)
dplyr::select(corr_df[order(corr_df$RPDRobbery),] , term, RPDRobbery)
NA
dplyr::select(corr_df[order(corr_df$RPDBurglary),] , term, RPDBurglary)
There are very few decent correlates in the robbery data - everything is a bit of a mess. That’s not the case in the burglary data however: we might be able to do some modeling here. General deprivation indicators stand out very sharply, correlated to ethnicity.
I’d normally start exploring spatial models and weights, but I think that’s a little outside the scope of this first project. Instead, let’s get straight to modeling.
log(msoa_matrix_numeric["Income Deprivation (2010) % living in income deprived households reliant on means tested benefit"])
msoa_burglary_copy <- msoa_matrix_numeric
names(msoa_burglary_copy)[names(msoa_burglary_copy) == "Income Deprivation (2010) % living in income deprived households reliant on means tested benefit"] <- "hhPercentBenefit"
names(msoa_burglary_copy)[names(msoa_burglary_copy) == "Lone Parents (2011 Census) Lone parents not in employment"] <- "UnempLoneParents"
names(msoa_burglary_copy)[names(msoa_burglary_copy) == "Ethnic Group (2011 Census) Black/African/Caribbean/Black British (%)"] <- "blackPercent"
names(msoa_burglary_copy)[names(msoa_burglary_copy) == "Health (2011 Census) Bad health (%)"] <- "percentBadHealth"
names(msoa_burglary_copy)[names(msoa_burglary_copy) == "Tenure (2011) Owned: Owned outright"] <- "HouseOwned"
feature_df <- dplyr::select(msoa_burglary_copy, RPDBurglary, hhPercentBenefit, UnempLoneParents, blackPercent, percentBadHealth, HouseOwned)
feature_df
We have a few NA. Let’s get rid of them
feature_df <- drop_na(feature_df, RPDBurglary)
feature_df
colnames(feature_df)
[1] "RPDBurglary" "hhPercentBenefit" "UnempLoneParents" "blackPercent" "percentBadHealth" "HouseOwned"
colSums(is.na(feature_df))
RPDBurglary hhPercentBenefit UnempLoneParents blackPercent percentBadHealth HouseOwned
0 0 0 0 0 0
pairs(feature_df)
We are going to want to perform a log transform. Because our RPD has negative values, we’ll need to add a constant before - in thise case, 3 (so everything is positive)
feature_df$RPDBurglaryTranform <- feature_df$RPDBurglary + 3
feature_df
Now, let’s produce our log transform columns
for (col in colnames(feature_df)){
new_name <- paste("log_", col, sep = "")
feature_df[new_name] <- log(feature_df[col])
}
NaNs produced
feat_transform_df <- feature_df[,9:14]
feat_transform_df
pairs(feat_transform_df)
correlate(feat_transform_df)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
mod_burglary <- lm(log_RPDBurglaryTranform ~ log_blackPercent + log_hhPercentBenefit , data = feat_transform_df)
summary(mod_burglary)
Call:
lm(formula = log_RPDBurglaryTranform ~ log_blackPercent + log_hhPercentBenefit,
data = feat_transform_df)
Residuals:
Min 1Q Median 3Q Max
-1.2291 -0.0747 0.1930 0.3203 0.5826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.26338 0.08107 15.584 <2e-16 ***
log_blackPercent -0.04064 0.03000 -1.355 0.176
log_hhPercentBenefit -0.03171 0.04473 -0.709 0.479
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4874 on 976 degrees of freedom
Multiple R-squared: 0.0121, Adjusted R-squared: 0.01008
F-statistic: 5.978 on 2 and 976 DF, p-value: 0.002627
mod_burglary <- lm(log_RPDBurglaryTranform ~ log_blackPercent , data = feat_transform_df)
summary(mod_burglary)
Call:
lm(formula = log_RPDBurglaryTranform ~ log_blackPercent, data = feat_transform_df)
Residuals:
Min 1Q Median 3Q Max
-1.22673 -0.07921 0.19246 0.32606 0.56948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21385 0.04113 29.515 < 2e-16 ***
log_blackPercent -0.05809 0.01716 -3.385 0.00074 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4872 on 977 degrees of freedom
Multiple R-squared: 0.01159, Adjusted R-squared: 0.01058
F-statistic: 11.46 on 1 and 977 DF, p-value: 0.0007397
So, we have significance. It looks like the main driver are general deprivation indicators, but due to correlation, we’re struggling to get any more detailed. Let’s use Random Forests to try and dig into this as bit further.
msoa_matrix_numeric
Once again, let’s get rid of NA. This time, let’s do our RPD, and then any columns
rf_msoa_matrix <- drop_na(msoa_matrix_numeric, RPDBurglary)
rf_msoa_matrix
clean_rf_matrix <- rf_msoa_matrix[ , colSums(is.na(rf_msoa_matrix)) == 0]
clean_rf_matrix
Right, we’ve now got our data good to go, with no NAs. https://towardsdatascience.com/random-forest-in-r-f66adf80ec9
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 㤼㸱randomForest㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
combine
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
margin
require(caTools)
Loading required package: caTools
library(caret)
clean_rf_matrix
drop<- c("burglaryActual","burglaryError","burglaryPercentError","burglaryPredicted","robberyActual","robberyPredicted","robberyError","robberyPercentError","RPDRobbery")
data<- clean_rf_matrix[,!(names(clean_rf_matrix) %in% drop)]
data
sample = sample.split(data$RPDBurglary, SplitRatio = 0.75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
dim(train)
[1] 734 208
dim(test)
[1] 245 208
rf <- randomForest(
RPDBurglary ~ .,
data=train
)
Error in eval(predvars, data, env) :
object 'Age Structure (2011 Census) All Ages' not found
Annoyingly, because my columns have white spaces, R doesn’t like it.
names(clean_rf_matrix)<-make.names(names(clean_rf_matrix),unique = TRUE)
drop<- c("burglaryActual","burglaryError","burglaryPercentError","burglaryPredicted","robberyActual","robberyPredicted","robberyError","robberyPercentError","RPDRobbery")
data<- clean_rf_matrix[,!(names(clean_rf_matrix) %in% drop)]
names(data)<- make.names(names(data),unique = TRUE)
data
sample = sample.split(data$RPDBurglary, SplitRatio = 0.75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
rf <- randomForest(
RPDBurglary ~ .,
data=train,
importance=TRUE
)
summary(rf)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 734 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 734 -none- numeric
importance 414 -none- numeric
importanceSD 207 -none- numeric
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 734 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
library(DALEX)
Welcome to DALEX (version: 2.1.1).
Find examples and detailed introduction at: http://ema.drwhy.ai/
Additional features will be available after installation of: ggpubr.
Use 'install_dependencies()' to get all suggested dependencies
Attaching package: 㤼㸱DALEX㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
explain
rf_explainer <- explain(rf, data=train, y= train$RPDBurglary)
Preparation of a new explainer is initiated
-> model label : randomForest ( default )
-> data : 734 rows 208 cols
-> data : tibble converted into a data.frame
-> target variable : 734 values
-> predict function : yhat.randomForest will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package randomForest , ver. 4.6.14 , task regression ( default )
-> predicted values : numerical, min = -1.403661 , mean = 0.2506974 , max = 1.575263
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -1.245388 , mean = 0.01763829 , max = 1.024018
A new explainer has been created!
rf_perf <- model_performance(rf_explainer)
rf_perf
Measures for: regression
mse : 0.2092694
rmse : 0.4574597
r2 : 0.8385729
mad : 0.300109
Residuals:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-1.24538780 -0.80900554 -0.31391093 -0.09581602 0.05259960 0.14255520 0.22973983 0.29717804 0.37484826 0.47989674 1.02401822
help(variable_importance)
var_importance_ranger_after_vi <- variable_importance(
rf_explainer,
loss_function = loss_root_mean_square,
B = 10,
type = "raw")
plot(var_importance_ranger_after_vi)
var_importance_ranger_after_vi
variable mean_dropout_loss label
1 _full_model_ 0.4574597 randomForest
2 RPDBurglary 0.4574597 randomForest
3 Road.Casualties.2012.Fatal 0.4578162 randomForest
4 Road.Casualties.2011.Fatal 0.4578242 randomForest
5 Road.Casualties.2010.Fatal 0.4590182 randomForest
6 UsualRes 0.4593372 randomForest
7 Mid.year.Estimate.totals.All.Ages.2011 0.4594384 randomForest
8 Mid.year.Estimate.totals.All.Ages.2009 0.4596475 randomForest
9 HholdRes 0.4598607 randomForest
10 AvHholdSz 0.4602045 randomForest
11 Age.Structure..2011.Census..All.Ages 0.4602387 randomForest
12 Mid.year.Estimate.totals.All.Ages.2012 0.4602477 randomForest
13 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident 0.4603354 randomForest
14 Health..2011.Census..Day.to.day.activities.not.limited 0.4603429 randomForest
15 Mid.year.Estimate.totals.All.Ages.2010 0.4605394 randomForest
16 Religion..2011..Other.religion.... 0.4605825 randomForest
17 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household.... 0.4607606 randomForest
18 Mid.year.Estimate.totals.All.Ages.2006 0.4609268 randomForest
19 Religion..2011..Buddhist.... 0.4609623 randomForest
20 House.Prices.Median.House.Price.....2007 0.4610875 randomForest
21 Mid.year.Estimate.totals.All.Ages.2008 0.4614278 randomForest
22 Households..2011..All.Households 0.4615022 randomForest
23 Road.Casualties.2010.Serious 0.4616143 randomForest
24 House.Prices.Median.House.Price.....2011 0.4616580 randomForest
25 Hholds 0.4616741 randomForest
26 Road.Casualties.2011.Serious 0.4619903 randomForest
27 Household.Language..2011..At.least.one.person.aged.16.and.over.in.household.has.English.as.a.main.language 0.4620275 randomForest
28 Country.of.Birth..2011..Not.United.Kingdom.... 0.4620599 randomForest
29 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household.... 0.4621042 randomForest
30 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household 0.4621944 randomForest
31 Health..2011.Census..Day.to.day.activities.limited.a.lot 0.4623724 randomForest
32 Age.Structure..2011.Census..Working.age 0.4624088 randomForest
33 Health..2011.Census..Fair.health 0.4624461 randomForest
34 Mid.year.Estimate.totals.All.Ages.2007 0.4624820 randomForest
35 Qualifications..2011.Census..Highest.level.of.qualification..Level.2.qualifications 0.4625426 randomForest
36 Car.or.van.availability..2011.Census..Sum.of.all.cars.or.vans.in.the.area 0.4625642 randomForest
37 Household.Language..2011....of.people.aged.16.and.over.in.household.have.English.as.a.main.language 0.4625868 randomForest
38 Household.Language..2011....of.households.where.no.people.in.household.have.English.as.a.main.language 0.4627663 randomForest
39 Religion..2011..Sikh.... 0.4627723 randomForest
40 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household.... 0.4627908 randomForest
41 Qualifications..2011.Census..Highest.level.of.qualification..Level.1.qualifications 0.4629197 randomForest
42 Road.Casualties.2011.Slight 0.4630393 randomForest
43 House.Prices.Median.House.Price.....2010 0.4630416 randomForest
44 Mid.year.Estimate.totals.All.Ages.2003 0.4630724 randomForest
45 Age.Structure..2011.Census..45.64 0.4631397 randomForest
46 Country.of.Birth..2011..Not.United.Kingdom 0.4631927 randomForest
47 Religion..2011..Hindu.... 0.4632348 randomForest
48 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan 0.4632463 randomForest
49 Health..2011.Census..Very.good.health 0.4632530 randomForest
50 Road.Casualties.2012.Serious 0.4633609 randomForest
51 Country.of.Birth..2011..United.Kingdom.... 0.4633620 randomForest
52 Mid.year.Estimates.2012..by.age.60.64 0.4633652 randomForest
53 Car.or.van.availability..2011.Census..Cars.per.household 0.4633891 randomForest
54 Car.or.van.availability..2011.Census..1.car.or.van.in.household 0.4635527 randomForest
55 Country.of.Birth..2011..United.Kingdom 0.4635588 randomForest
56 Mid.year.Estimate.totals.All.Ages.2004 0.4636076 randomForest
57 Health..2011.Census..Day.to.day.activities.limited.a.little 0.4636114 randomForest
58 Ethnic.Group..2011.Census..BAME.... 0.4636206 randomForest
59 Mid.year.Estimates.2012..by.age.35.39 0.4636659 randomForest
60 Ethnic.Group..2011.Census..Other.ethnic.group 0.4636975 randomForest
61 Religion..2011..Jewish.... 0.4638245 randomForest
62 Mid.year.Estimates.2012..by.age.50.54 0.4638574 randomForest
63 Mid.year.Estimates.2012..by.age.5.9 0.4638771 randomForest
64 Household.Composition..2011..Numbers.Lone.parent.household 0.4638789 randomForest
65 Tenure..2011..Private.rented.... 0.4638809 randomForest
66 Health..2011.Census..Good.health 0.4639537 randomForest
67 Mid.year.Estimates.2012..by.age.15.19 0.4639768 randomForest
68 Economic.Activity..2011.Census..Economically.active..Total 0.4640246 randomForest
69 House.Prices.Median.House.Price.....2006 0.4640953 randomForest
70 Health..2011.Census..Very.bad.health 0.4641509 randomForest
71 Road.Casualties.2012.2012.Total 0.4641682 randomForest
72 House.Prices.Sales.2007 0.4642258 randomForest
73 Road.Casualties.2011.2011.Total 0.4643766 randomForest
74 Age.Structure..2011.Census..30.44 0.4643893 randomForest
75 Tenure..2011..Social.rented 0.4644427 randomForest
76 House.Prices.Median.House.Price.....2008 0.4644921 randomForest
77 Household.Language..2011..No.people.in.household.have.English.as.a.main.language 0.4644967 randomForest
78 Mid.year.Estimate.totals.All.Ages.2005 0.4645591 randomForest
79 Dwelling.type..2011..Flat..maisonette.or.apartment.... 0.4645668 randomForest
80 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household.... 0.4646295 randomForest
81 Mid.year.Estimates.2012..by.age.45.49 0.4646569 randomForest
82 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household 0.4646744 randomForest
83 House.Prices.Sales.2010 0.4647759 randomForest
84 Mid.year.Estimates.2012..by.age.65.69 0.4648435 randomForest
85 Qualifications..2011.Census..No.qualifications 0.4649392 randomForest
86 Mid.year.Estimates.2012..by.age.55.59 0.4649803 randomForest
87 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household 0.4650008 randomForest
88 House.Prices.Median.House.Price.....2005 0.4650406 randomForest
89 Religion..2011..Buddhist 0.4650480 randomForest
90 Household.Composition..2011..Numbers.Couple.household.with.dependent.children 0.4650768 randomForest
91 Ethnic.Group..2011.Census..Other.ethnic.group.... 0.4651020 randomForest
92 House.Prices.Median.House.Price.....2012 0.4651438 randomForest
93 Mid.year.Estimates.2012..by.age.40.44 0.4651446 randomForest
94 Mid.year.Estimates.2012..by.age.85.89 0.4651512 randomForest
95 Household.Composition..2011..Numbers.One.person.household 0.4651658 randomForest
96 Mid.year.Estimates.2012..by.age.30.34 0.4651852 randomForest
97 Ethnic.Group..2011.Census..Asian.Asian.British.... 0.4651892 randomForest
98 Ethnic.Group..2011.Census..White 0.4652165 randomForest
99 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan.... 0.4652298 randomForest
100 Qualifications..2011.Census..Highest.level.of.qualification..Level.4.qualifications.and.above 0.4652353 randomForest
101 Mid.year.Estimates.2012..by.age.25.29 0.4653136 randomForest
102 Qualifications..2011.Census..Highest.level.of.qualification..Apprenticeship 0.4653622 randomForest
103 Ethnic.Group..2011.Census..White.... 0.4654519 randomForest
104 Car.or.van.availability..2011.Census..1.car.or.van.in.household.... 0.4656273 randomForest
105 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached 0.4656456 randomForest
106 Dwelling.type..2011..Household.spaces.with.no.usual.residents.... 0.4657004 randomForest
107 Household.Composition..2011..Numbers.Couple.household.without.dependent.children 0.4657572 randomForest
108 Lone.Parents..2011.Census..All.lone.parent.housholds.with.dependent.children 0.4657844 randomForest
109 Tenure..2011..Private.rented 0.4658002 randomForest
110 Religion..2011..No.religion.... 0.4660178 randomForest
111 House.Prices.Sales.2013.p. 0.4660392 randomForest
112 Tenure..2011..Owned..Owned.outright.... 0.4660447 randomForest
113 Dwelling.type..2011..Flat..maisonette.or.apartment 0.4661645 randomForest
114 PopDen 0.4662038 randomForest
115 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household 0.4662422 randomForest
116 Ethnic.Group..2011.Census..BAME 0.4664845 randomForest
117 Religion..2011..No.religion 0.4665126 randomForest
118 House.Prices.Median.House.Price.....2009 0.4665186 randomForest
119 Health..2011.Census..Day.to.day.activities.limited.a.little.... 0.4665245 randomForest
120 House.Prices.Sales.2005 0.4668337 randomForest
121 Qualifications..2011.Census..Highest.level.of.qualification..Other.qualifications 0.4668340 randomForest
122 Economic.Activity..2011.Census..Economically.inactive..Total 0.4668704 randomForest
123 Religion..2011..Hindu 0.4670220 randomForest
124 Mid.year.Estimates.2012..by.age.90. 0.4670683 randomForest
125 Mid.year.Estimates.2012..by.age.10.14 0.4671924 randomForest
126 Lone.Parents..2011.Census..Lone.parents.not.in.employment 0.4672273 randomForest
127 Health..2011.Census..Very.good.health.... 0.4674076 randomForest
128 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident.... 0.4674148 randomForest
129 House.Prices.Sales.2009 0.4674673 randomForest
130 Ethnic.Group..2011.Census..Asian.Asian.British 0.4674830 randomForest
131 Religion..2011..Religion.not.stated.... 0.4675539 randomForest
132 Household.Composition..2011..Percentages.Lone.parent.household 0.4676098 randomForest
133 Obesity.Percentage.of.the.population.aged.16..with.a.BMI.of.30...modelled.estimate..2006.2008 0.4676637 randomForest
134 Mid.year.Estimates.2012..by.age.80.84 0.4676930 randomForest
135 Religion..2011..Christian 0.4677124 randomForest
136 Population.Density.Persons.per.hectare..2012. 0.4677649 randomForest
137 Road.Casualties.2010.Slight 0.4679157 randomForest
138 Age.Structure..2011.Census..16.29 0.4679926 randomForest
139 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached.... 0.4680284 randomForest
140 Dwelling.type..2011..Whole.house.or.bungalow..Detached 0.4680587 randomForest
141 Religion..2011..Muslim 0.4680760 randomForest
142 House.Prices.Median.House.Price.....2013..p. 0.4680936 randomForest
143 Age.Structure..2011.Census..65. 0.4681371 randomForest
144 Tenure..2011..Social.rented.... 0.4681801 randomForest
145 Health..2011.Census..Bad.health 0.4683863 randomForest
146 Religion..2011..Sikh 0.4685185 randomForest
147 Household.Composition..2011..Percentages.One.person.household 0.4685197 randomForest
148 Qualifications..2011.Census..Highest.level.of.qualification..Level.3.qualifications 0.4686376 randomForest
149 Household.Composition..2011..Percentages.Other.household.Types 0.4686828 randomForest
150 Health..2011.Census..Good.health.... 0.4687342 randomForest
151 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups 0.4687398 randomForest
152 Religion..2011..Other.religion 0.4687921 randomForest
153 Mid.year.Estimates.2012..by.age...15.64 0.4688589 randomForest
154 Road.Casualties.2010.2010.Total 0.4688710 randomForest
155 Health..2011.Census..Day.to.day.activities.not.limited.... 0.4688716 randomForest
156 Mid.year.Estimates.2012..by.age.20.24 0.4690163 randomForest
157 Mid.year.Estimate.totals.All.Ages.2002 0.4690649 randomForest
158 Dwelling.type..2011..Household.spaces.with.no.usual.residents 0.4690958 randomForest
159 Health..2011.Census..Bad.health.... 0.4691017 randomForest
160 Land.Area.Hectares 0.4691518 randomForest
161 Health..2011.Census..Very.bad.health.... 0.4692415 randomForest
162 Economic.Activity..2011.Census..Economically.active.. 0.4693216 randomForest
163 Mid.year.Estimates.2012..by.age.75.79 0.4693261 randomForest
164 House.Prices.Sales.2011...129 0.4694209 randomForest
165 Religion..2011..Christian.... 0.4694264 randomForest
166 Income.Deprivation..2010....of.people.aged.over.60.who.live.in.pension.credit.households 0.4695216 randomForest
167 House.Prices.Sales.2011...130 0.4695370 randomForest
168 Household.Composition..2011..Numbers.Other.household.Types 0.4695930 randomForest
169 Household.Income.Estimates..2011.12..Total.Mean.Annual.Household.Income.... 0.4698550 randomForest
170 Incidence.of.Cancer.All 0.4699335 randomForest
171 Health..2011.Census..Day.to.day.activities.limited.a.lot.... 0.4699552 randomForest
172 Household.Income.Estimates..2011.12..Total.Median.Annual.Household.Income.... 0.4699974 randomForest
173 Economic.Activity..2011.Census..Economically.inactive.. 0.4700493 randomForest
174 Qualifications..2011.Census..Schoolchildren.and.full.time.students..Age.18.and.over 0.4701093 randomForest
175 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace. 0.4703208 randomForest
176 Religion..2011..Religion.not.stated 0.4703725 randomForest
177 Road.Casualties.2012.Slight 0.4703863 randomForest
178 Life.Expectancy.Males 0.4703947 randomForest
179 Central.Heating..2011.Census..Households.with.central.heating.... 0.4704331 randomForest
180 Age.Structure..2011.Census..0.15 0.4704828 randomForest
181 House.Prices.Sales.2006 0.4710013 randomForest
182 Dwelling.type..2011..Whole.house.or.bungalow..Detached.... 0.4711949 randomForest
183 Income.Deprivation..2010....living.in.income.deprived.households.reliant.on.means.tested.benefit 0.4712269 randomForest
184 Religion..2011..Muslim.... 0.4717758 randomForest
185 Household.Composition..2011..Percentages.Couple.household.with.dependent.children 0.4719047 randomForest
186 Religion..2011..Jewish 0.4722026 randomForest
187 Mid.year.Estimates.2012..by.age...65. 0.4728008 randomForest
188 Life.Expectancy.Females 0.4730976 randomForest
189 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British 0.4734502 randomForest
190 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace..... 0.4744510 randomForest
191 Incidence.of.Cancer.Breast.Cancer 0.4746585 randomForest
192 Mid.year.Estimates.2012..by.age...0.to.14 0.4748548 randomForest
193 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British.... 0.4754857 randomForest
194 Lone.Parents..2011.Census..Lone.parent.not.in.employment.. 0.4760813 randomForest
195 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups.... 0.4762080 randomForest
196 Health..2011.Census..Fair.health.... 0.4775044 randomForest
197 Economic.Activity..2011.Census..Economically.active..Unemployed 0.4788812 randomForest
198 Mid.year.Estimates.2012..by.age.0.4 0.4792534 randomForest
199 Adults.in.Employment..2011.Census....of.households.with.no.adults.in.employment..With.dependent.children 0.4802556 randomForest
200 Household.Composition..2011..Percentages.Couple.household.without.dependent.children 0.4813484 randomForest
201 Adults.in.Employment..2011.Census..No.adults.in.employment.in.household..With.dependent.children 0.4819720 randomForest
202 Mid.year.Estimates.2012..by.age.70.74 0.4822367 randomForest
203 Tenure..2011..Owned..Owned.outright 0.4843972 randomForest
204 House.Prices.Sales.2008 0.4856474 randomForest
205 Low.Birth.Weight.Births..2007.2011..UCL...Upper.confidence.limit 0.4862601 randomForest
206 Low.Birth.Weight.Births..2007.2011..Low.Birth.Weight.Births.... 0.4896779 randomForest
207 Low.Birth.Weight.Births..2007.2011..LCL...Lower.confidence.limit 0.4915561 randomForest
208 Economic.Activity..2011.Census..Unemployment.Rate 0.4977837 randomForest
209 ComEstRes 0.5692702 randomForest
210 _baseline_ 1.3392067 randomForest
model_parts <-model_parts(rf_explainer)
model_parts
variable mean_dropout_loss label
1 _full_model_ 0.4574597 randomForest
2 RPDBurglary 0.4574597 randomForest
3 Road.Casualties.2012.Fatal 0.4578200 randomForest
4 Road.Casualties.2011.Fatal 0.4578700 randomForest
5 Road.Casualties.2010.Fatal 0.4590274 randomForest
6 UsualRes 0.4593322 randomForest
7 Mid.year.Estimate.totals.All.Ages.2011 0.4593718 randomForest
8 Mid.year.Estimate.totals.All.Ages.2009 0.4596613 randomForest
9 HholdRes 0.4598399 randomForest
10 AvHholdSz 0.4602644 randomForest
11 Mid.year.Estimate.totals.All.Ages.2012 0.4602704 randomForest
12 Age.Structure..2011.Census..All.Ages 0.4602978 randomForest
13 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident 0.4603912 randomForest
14 Health..2011.Census..Day.to.day.activities.not.limited 0.4604403 randomForest
15 Mid.year.Estimate.totals.All.Ages.2010 0.4604825 randomForest
16 Religion..2011..Other.religion.... 0.4605234 randomForest
17 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household.... 0.4607462 randomForest
18 Religion..2011..Buddhist.... 0.4608544 randomForest
19 Mid.year.Estimate.totals.All.Ages.2006 0.4608830 randomForest
20 House.Prices.Median.House.Price.....2007 0.4610432 randomForest
21 Households..2011..All.Households 0.4614274 randomForest
22 Mid.year.Estimate.totals.All.Ages.2008 0.4615069 randomForest
23 Hholds 0.4615681 randomForest
24 Road.Casualties.2010.Serious 0.4616108 randomForest
25 House.Prices.Median.House.Price.....2011 0.4618015 randomForest
26 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household.... 0.4619867 randomForest
27 Household.Language..2011..At.least.one.person.aged.16.and.over.in.household.has.English.as.a.main.language 0.4620052 randomForest
28 Road.Casualties.2011.Serious 0.4620302 randomForest
29 Country.of.Birth..2011..Not.United.Kingdom.... 0.4622100 randomForest
30 Health..2011.Census..Fair.health 0.4623212 randomForest
31 Health..2011.Census..Day.to.day.activities.limited.a.lot 0.4623938 randomForest
32 Car.or.van.availability..2011.Census..2.cars.or.vans.in.household 0.4624130 randomForest
33 Household.Language..2011....of.people.aged.16.and.over.in.household.have.English.as.a.main.language 0.4624304 randomForest
34 Age.Structure..2011.Census..Working.age 0.4624767 randomForest
35 Qualifications..2011.Census..Highest.level.of.qualification..Level.2.qualifications 0.4625207 randomForest
36 Mid.year.Estimate.totals.All.Ages.2007 0.4625307 randomForest
37 Car.or.van.availability..2011.Census..Sum.of.all.cars.or.vans.in.the.area 0.4625773 randomForest
38 Religion..2011..Sikh.... 0.4627246 randomForest
39 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household.... 0.4627598 randomForest
40 Household.Language..2011....of.households.where.no.people.in.household.have.English.as.a.main.language 0.4627829 randomForest
41 Road.Casualties.2011.Slight 0.4629019 randomForest
42 Health..2011.Census..Very.good.health 0.4629458 randomForest
43 Religion..2011..Hindu.... 0.4630029 randomForest
44 House.Prices.Median.House.Price.....2010 0.4630116 randomForest
45 Mid.year.Estimate.totals.All.Ages.2003 0.4630255 randomForest
46 Age.Structure..2011.Census..45.64 0.4630483 randomForest
47 Qualifications..2011.Census..Highest.level.of.qualification..Level.1.qualifications 0.4630705 randomForest
48 Country.of.Birth..2011..Not.United.Kingdom 0.4631887 randomForest
49 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan 0.4632885 randomForest
50 Car.or.van.availability..2011.Census..Cars.per.household 0.4633223 randomForest
51 Mid.year.Estimates.2012..by.age.60.64 0.4633311 randomForest
52 Country.of.Birth..2011..United.Kingdom.... 0.4633968 randomForest
53 Road.Casualties.2012.Serious 0.4635364 randomForest
54 Mid.year.Estimate.totals.All.Ages.2004 0.4635498 randomForest
55 Car.or.van.availability..2011.Census..1.car.or.van.in.household 0.4635730 randomForest
56 Country.of.Birth..2011..United.Kingdom 0.4635925 randomForest
57 Health..2011.Census..Day.to.day.activities.limited.a.little 0.4636087 randomForest
58 Mid.year.Estimates.2012..by.age.35.39 0.4637321 randomForest
59 Mid.year.Estimates.2012..by.age.15.19 0.4637417 randomForest
60 Ethnic.Group..2011.Census..BAME.... 0.4637771 randomForest
61 Mid.year.Estimates.2012..by.age.5.9 0.4637791 randomForest
62 Ethnic.Group..2011.Census..Other.ethnic.group 0.4638037 randomForest
63 Religion..2011..Jewish.... 0.4638085 randomForest
64 Mid.year.Estimates.2012..by.age.50.54 0.4638372 randomForest
65 Tenure..2011..Private.rented.... 0.4638612 randomForest
66 Household.Composition..2011..Numbers.Lone.parent.household 0.4638954 randomForest
67 House.Prices.Median.House.Price.....2006 0.4639528 randomForest
68 Health..2011.Census..Good.health 0.4639818 randomForest
69 House.Prices.Sales.2007 0.4640659 randomForest
70 Economic.Activity..2011.Census..Economically.active..Total 0.4640923 randomForest
71 Road.Casualties.2012.2012.Total 0.4642396 randomForest
72 Household.Language..2011..No.people.in.household.have.English.as.a.main.language 0.4642865 randomForest
73 Health..2011.Census..Very.bad.health 0.4643521 randomForest
74 House.Prices.Median.House.Price.....2008 0.4644015 randomForest
75 Tenure..2011..Social.rented 0.4644247 randomForest
76 Road.Casualties.2011.2011.Total 0.4645202 randomForest
77 Mid.year.Estimates.2012..by.age.45.49 0.4645272 randomForest
78 Dwelling.type..2011..Flat..maisonette.or.apartment.... 0.4645396 randomForest
79 Mid.year.Estimate.totals.All.Ages.2005 0.4645396 randomForest
80 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household 0.4645850 randomForest
81 Mid.year.Estimates.2012..by.age.65.69 0.4646241 randomForest
82 Age.Structure..2011.Census..30.44 0.4646842 randomForest
83 House.Prices.Sales.2010 0.4647005 randomForest
84 Mid.year.Estimates.2012..by.age.55.59 0.4647455 randomForest
85 Qualifications..2011.Census..No.qualifications 0.4647521 randomForest
86 Mid.year.Estimates.2012..by.age.85.89 0.4648053 randomForest
87 Ethnic.Group..2011.Census..Asian.Asian.British.... 0.4649362 randomForest
88 Car.or.van.availability..2011.Census..No.cars.or.vans.in.household.... 0.4649887 randomForest
89 Household.Composition..2011..Numbers.Couple.household.with.dependent.children 0.4650380 randomForest
90 House.Prices.Median.House.Price.....2005 0.4650593 randomForest
91 Ethnic.Group..2011.Census..Other.ethnic.group.... 0.4650595 randomForest
92 Mid.year.Estimates.2012..by.age.30.34 0.4650866 randomForest
93 Household.Composition..2011..Numbers.One.person.household 0.4651248 randomForest
94 Car.or.van.availability..2011.Census..4.or.more.cars.or.vans.in.household 0.4651832 randomForest
95 Religion..2011..Buddhist 0.4651846 randomForest
96 House.Prices.Median.House.Price.....2012 0.4652165 randomForest
97 Mid.year.Estimates.2012..by.age.25.29 0.4652273 randomForest
98 Mid.year.Estimates.2012..by.age.40.44 0.4653321 randomForest
99 Qualifications..2011.Census..Highest.level.of.qualification..Level.4.qualifications.and.above 0.4653572 randomForest
100 Ethnic.Group..2011.Census..White.... 0.4653608 randomForest
101 Ethnic.Group..2011.Census..White 0.4654119 randomForest
102 Tenure..2011..Owned..Owned.with.a.mortgage.or.loan.... 0.4654341 randomForest
103 Car.or.van.availability..2011.Census..1.car.or.van.in.household.... 0.4654860 randomForest
104 Dwelling.type..2011..Household.spaces.with.no.usual.residents.... 0.4655607 randomForest
105 Qualifications..2011.Census..Highest.level.of.qualification..Apprenticeship 0.4655982 randomForest
106 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached 0.4656553 randomForest
107 Household.Composition..2011..Numbers.Couple.household.without.dependent.children 0.4657463 randomForest
108 Tenure..2011..Private.rented 0.4658284 randomForest
109 Lone.Parents..2011.Census..All.lone.parent.housholds.with.dependent.children 0.4659055 randomForest
110 Religion..2011..No.religion.... 0.4659809 randomForest
111 House.Prices.Sales.2013.p. 0.4660846 randomForest
112 PopDen 0.4661753 randomForest
113 Tenure..2011..Owned..Owned.outright.... 0.4661914 randomForest
114 Dwelling.type..2011..Flat..maisonette.or.apartment 0.4662730 randomForest
115 Car.or.van.availability..2011.Census..3.cars.or.vans.in.household 0.4663467 randomForest
116 Economic.Activity..2011.Census..Economically.inactive..Total 0.4665490 randomForest
117 Religion..2011..No.religion 0.4665743 randomForest
118 Qualifications..2011.Census..Highest.level.of.qualification..Other.qualifications 0.4665861 randomForest
119 Ethnic.Group..2011.Census..BAME 0.4665951 randomForest
120 Mid.year.Estimates.2012..by.age.90. 0.4666822 randomForest
121 Health..2011.Census..Day.to.day.activities.limited.a.little.... 0.4666956 randomForest
122 House.Prices.Median.House.Price.....2009 0.4670617 randomForest
123 Mid.year.Estimates.2012..by.age.10.14 0.4670992 randomForest
124 Religion..2011..Hindu 0.4671744 randomForest
125 Lone.Parents..2011.Census..Lone.parents.not.in.employment 0.4672197 randomForest
126 Health..2011.Census..Very.good.health.... 0.4672697 randomForest
127 House.Prices.Sales.2005 0.4672761 randomForest
128 Dwelling.type..2011..Household.spaces.with.at.least.one.usual.resident.... 0.4672922 randomForest
129 Obesity.Percentage.of.the.population.aged.16..with.a.BMI.of.30...modelled.estimate..2006.2008 0.4674694 randomForest
130 Religion..2011..Religion.not.stated.... 0.4674761 randomForest
131 House.Prices.Sales.2009 0.4676134 randomForest
132 Ethnic.Group..2011.Census..Asian.Asian.British 0.4676269 randomForest
133 Religion..2011..Christian 0.4676932 randomForest
134 Age.Structure..2011.Census..16.29 0.4677159 randomForest
135 Population.Density.Persons.per.hectare..2012. 0.4677211 randomForest
136 Mid.year.Estimates.2012..by.age.80.84 0.4678363 randomForest
137 Dwelling.type..2011..Whole.house.or.bungalow..Semi.detached.... 0.4678430 randomForest
138 Household.Composition..2011..Percentages.Lone.parent.household 0.4678824 randomForest
139 Age.Structure..2011.Census..65. 0.4679577 randomForest
140 Dwelling.type..2011..Whole.house.or.bungalow..Detached 0.4679825 randomForest
141 Road.Casualties.2010.Slight 0.4680021 randomForest
142 Tenure..2011..Social.rented.... 0.4681448 randomForest
143 House.Prices.Median.House.Price.....2013..p. 0.4681463 randomForest
144 Health..2011.Census..Bad.health 0.4682243 randomForest
145 Household.Composition..2011..Percentages.One.person.household 0.4682857 randomForest
146 Religion..2011..Sikh 0.4683711 randomForest
147 Religion..2011..Muslim 0.4684244 randomForest
148 Mid.year.Estimates.2012..by.age.20.24 0.4687165 randomForest
149 Religion..2011..Other.religion 0.4687196 randomForest
150 Household.Composition..2011..Percentages.Other.household.Types 0.4687474 randomForest
151 Qualifications..2011.Census..Highest.level.of.qualification..Level.3.qualifications 0.4687744 randomForest
152 Health..2011.Census..Good.health.... 0.4689070 randomForest
153 Mid.year.Estimates.2012..by.age...15.64 0.4689236 randomForest
154 Land.Area.Hectares 0.4689759 randomForest
155 House.Prices.Sales.2011...130 0.4690715 randomForest
156 Road.Casualties.2010.2010.Total 0.4691619 randomForest
157 Health..2011.Census..Day.to.day.activities.not.limited.... 0.4692131 randomForest
158 Health..2011.Census..Bad.health.... 0.4692270 randomForest
159 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups 0.4692414 randomForest
160 Dwelling.type..2011..Household.spaces.with.no.usual.residents 0.4692540 randomForest
161 Religion..2011..Christian.... 0.4692788 randomForest
162 Household.Composition..2011..Numbers.Other.household.Types 0.4692992 randomForest
163 Health..2011.Census..Very.bad.health.... 0.4693075 randomForest
164 Economic.Activity..2011.Census..Economically.active.. 0.4694227 randomForest
165 Mid.year.Estimate.totals.All.Ages.2002 0.4694391 randomForest
166 Mid.year.Estimates.2012..by.age.75.79 0.4694826 randomForest
167 House.Prices.Sales.2011...129 0.4695688 randomForest
168 Income.Deprivation..2010....of.people.aged.over.60.who.live.in.pension.credit.households 0.4697956 randomForest
169 Household.Income.Estimates..2011.12..Total.Mean.Annual.Household.Income.... 0.4699455 randomForest
170 Health..2011.Census..Day.to.day.activities.limited.a.lot.... 0.4699502 randomForest
171 Incidence.of.Cancer.All 0.4700787 randomForest
172 Economic.Activity..2011.Census..Economically.inactive.. 0.4701036 randomForest
173 Household.Income.Estimates..2011.12..Total.Median.Annual.Household.Income.... 0.4701785 randomForest
174 Life.Expectancy.Males 0.4702236 randomForest
175 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace. 0.4702649 randomForest
176 Qualifications..2011.Census..Schoolchildren.and.full.time.students..Age.18.and.over 0.4702706 randomForest
177 Central.Heating..2011.Census..Households.with.central.heating.... 0.4703402 randomForest
178 Road.Casualties.2012.Slight 0.4706094 randomForest
179 Religion..2011..Religion.not.stated 0.4706539 randomForest
180 Age.Structure..2011.Census..0.15 0.4707242 randomForest
181 House.Prices.Sales.2006 0.4712289 randomForest
182 Dwelling.type..2011..Whole.house.or.bungalow..Detached.... 0.4713404 randomForest
183 Income.Deprivation..2010....living.in.income.deprived.households.reliant.on.means.tested.benefit 0.4713521 randomForest
184 Household.Composition..2011..Percentages.Couple.household.with.dependent.children 0.4718472 randomForest
185 Religion..2011..Muslim.... 0.4722965 randomForest
186 Religion..2011..Jewish 0.4724288 randomForest
187 Life.Expectancy.Females 0.4726909 randomForest
188 Mid.year.Estimates.2012..by.age...65. 0.4731182 randomForest
189 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British 0.4739721 randomForest
190 Incidence.of.Cancer.Breast.Cancer 0.4744473 randomForest
191 Dwelling.type..2011..Whole.house.or.bungalow..Terraced..including.end.terrace..... 0.4750698 randomForest
192 Ethnic.Group..2011.Census..Black.African.Caribbean.Black.British.... 0.4754192 randomForest
193 Mid.year.Estimates.2012..by.age...0.to.14 0.4755040 randomForest
194 Ethnic.Group..2011.Census..Mixed.multiple.ethnic.groups.... 0.4760322 randomForest
195 Lone.Parents..2011.Census..Lone.parent.not.in.employment.. 0.4766007 randomForest
196 Health..2011.Census..Fair.health.... 0.4767855 randomForest
197 Economic.Activity..2011.Census..Economically.active..Unemployed 0.4783366 randomForest
198 Mid.year.Estimates.2012..by.age.0.4 0.4789625 randomForest
199 Adults.in.Employment..2011.Census....of.households.with.no.adults.in.employment..With.dependent.children 0.4798549 randomForest
200 Household.Composition..2011..Percentages.Couple.household.without.dependent.children 0.4814946 randomForest
201 Adults.in.Employment..2011.Census..No.adults.in.employment.in.household..With.dependent.children 0.4819823 randomForest
202 Mid.year.Estimates.2012..by.age.70.74 0.4825509 randomForest
203 Tenure..2011..Owned..Owned.outright 0.4842805 randomForest
204 House.Prices.Sales.2008 0.4857943 randomForest
205 Low.Birth.Weight.Births..2007.2011..UCL...Upper.confidence.limit 0.4877600 randomForest
206 Low.Birth.Weight.Births..2007.2011..LCL...Lower.confidence.limit 0.4906449 randomForest
207 Low.Birth.Weight.Births..2007.2011..Low.Birth.Weight.Births.... 0.4913146 randomForest
208 Economic.Activity..2011.Census..Unemployment.Rate 0.4975827 randomForest
209 ComEstRes 0.5668495 randomForest
210 _baseline_ 1.3432690 randomForest
plot(model_parts, max_vars=10)
pdp <- model_profile(rf_explainer)
plot(pdp, variables="ComEstRes")
plot(pdp, variables= "Economic.Activity..2011.Census..Unemployment.Rate")
plot(pdp, variables=("ComEstRes", "Economic.Activity..2011.Census..Unemployment.Rate"))
Error: unexpected ',' in "plot(pdp, variables=("ComEstRes","
pred = predict(rf, newdata=test)
test$Predictions<-pred
test
sqrt(sum(pred - test$RPDBurglary)^2) #RMSE
[1] 2.651238
ggplot(test, aes(x=RPDBurglary, y=Predictions)) + geom_point()